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Abstract 

Alternative splicing is the post- 
transcriptional process by which a single 
gene can produce multiple transcripts and 
thereby protein isoforms. The presence of 
different transcripts of a gene across samples 
can be analysed by whole-transcriptome mi- 
croarrays. Reproducing results from published 
microarray data represents a challenge due 
to the vast amounts of data and the large 
variety of pre-processing and filtering steps 
employed before the actual analysis is carried 
out. To ensure a firm basis for methodological 
development where results with new methods 
are compared with previous results it is crucial 
to ensure that all analyses are completely 
reproducible for other researchers. 

We here give a detailed workflow on 
how to perform reproducible analysis of 
the GeneChip® Human Exon 1.0 ST Ar- 
ray at probe and probeset level solely in 
R/Bioconductor, choosing packages based on 
their simplicity of use. To exemplify the use 
of the proposed workflow we analyse differen- 
tial splicing and differential gene expression in 
a publicly available dataset using various stat- 
istical methods. 

We believe this study will provide other re- 
searchers with an easy way of accessing gene ex- 
pression data at different annotation levels and 
with the sufficient details needed for developing 
their own tools for reproducible analysis of the 
GeneChip® Human Exon 1.0 ST Array. 
Contact: maria@math.aau.dk 

1 Introduction 

In the field of microarrays it has traditionally been 
difficult to compare new methods to methods in pub- 
lished papers as the many methods available for pre- 
processing, summarizing and filtering make it almost 
impossible to work with the exact same data, even 
when the raw data is made available. That is why we 
consider reproducible research fundamental as it will 
facilitate easy 1) revision of papers, 2) access to data 



and results, 3) communication to other researchers 
and 4) comparison between different methods. Re- 
producible research is gaining relevance among the 
scientific community as shown by the number of pa- 
pers published on the subject during the last years 
[H [31 0]. Ioannidis et al. showed that the res- 
ults of only 2 out of 18 published microarray gene- 
expression analyses were completely reproducible [S] . 
This is why some authors demand that document- 
ation and annotation, database accessions and URL 
links and even scripts with instructions are made pub- 
licly available [2J. Journals like Biostatistics have 
even appointed an Associate Editor for reproducible 
research, but still treat it as a "desirable goal" rather 
than a requirement [BJ. Setting up a framework 
for reproducible research necessarily implies working 
with free and open-source software as for example 
R/Bioconductor [71 [S]- Additionally, using Sweave 
9J (a tool for embedding R code in IATj^Xdocuments 
[11]). enables automatic reports that can be updated 
with output from the analysis. 

The main tool in this paper will be the Biocon- 
ductor package aroma. affymetrix |12j that can 
analyse all Affymetrix chip types with a (.CDF) 
Chip Definition File. The number of arrays 
(samples) that can be simultaneously analysed by 
aroma. affymetrix is virtually unlimited as the sys- 
tem requirements are just 1GB RAM, for any operat- 
ing system [13] , This package is freely available and 
can easily be installed into Ft. The aroma, affymetrix 
website www. aroma- project . org is conceived as ref- 
erence for all the possible microarrays that can be 
analysed with aroma. affymetrix, and does not fo- 
cus specifically on the analysis of the GeneChip® 
Human Exon 1.0 ST Array (or exon array in short). 
Portable scripts for a fast and basic analysis can be 
obtained on request to aroma, affymetrix's authors. 

The analysis of exon array data in R/Bioconductor 
is not yet standard. There are several packages 
available, and it can be a tremendous effort for 
a newcomer to maneuver between them, and to 
overcome the numerous challenges associated with 
these packages. This paper aims to make this task 
easier and to provide a quick reference guide to 
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aroma. affymetrix's documentation. We also ex- 
plain how to extract data for different statistical ana- 
lyses and propose a method for gene annotation and 
for gene profile visualization. For this last step, we 
use the packages biomaRt and GenomeGraphs to an- 
notate and visualize the transcripts in a genomic con- 
text. 

The analysis workflow presented in this paper is 
carried out solely in R/Bioconductor, and the paper 
is available as a Sweave (.Snw) document that will 
allow the reader to reproduce our exact results. The 
.Snw document can also be converted into an R script 
and executed. The workflow starts by reading in the 
data, followed by background correction and quantile 
normalization. We then explain how to obtain tran- 
script cluster -, probeset -, and probe-level estimates. 
Afterwards, different methods for the statistical ana- 
lysis of differential splicing or differential gene expres- 
sion are reviewed. Finally, we make a suggestion on 
how to annotate transcript clusters to genes in the 
lists obtained from the statistical analyses, and how 
to plot the data including genomic information. To 
exemplify the use of the workflow, an example data- 
set [H] is analysed along the way. 

2 Background on alternative 
splicing 

Splicing is the post-transcriptional process that gen- 
erates mature eukaryotic mRNAs from pre-mRNAs 
by removing the non-coding intronic regions and join- 
ing together the exonic coding regions. For many 
genes, two or more splicing events take place during 
maturation of mRNA molecules resulting in a cor- 
responding number of alternatively spliced mRNAs. 
These mature mRNAs translate into protein isoforms 
differing in their amino acid sequence and ultimately 
in their biochemical and biological properties [T5lll6| . 
Alternative splicing is one of the main tools for gen- 
erating RNA diversity, contributing to the diverse 
repertoire of transcripts and proteins |16l I17j . It is 
known that 92-94% of multi-exon human genes are 
alternatively spliced and that 85% of those have a 
minor isoform frequency of at least 15% [T51 UH] ■ In 
our case we will focus on the detection of differential 
splicing between groups, as for instance tissue types, 
or healthy vs. diseased samples. 

The exon array was presented in October 2005 
as a tool for the analysis and profiling of whole- 
transcriptome expression [501 [5TJ [55] . To interrogate 
each potential exon with at least one probeset, the 
exon array contains about 5.6 million probes grouped 
into more than 1.4 million probesets (most probesets 
consisting of 4 probes), which are further grouped 



into 1.1 million exon clusters, or collections of over- 
lapping exons. Finally, exon clusters are grouped into 
over 300.000 transcript clusters to describe their re- 
lationship, as for example shared splice sites or over- 
lapping exonic sequences. Each gene is covered, on 
average, by 40 probes interrogating regions located 
along the entire gene [23] . This probe positioning 
aims at providing better estimates of gene expression 
levels than previous arrays, and allows for the study 
of differential splicing [53] and differential gene ex- 
pression based on summarized exon expression. 

The exon array has three levels of annotation for 
the interrogated transcript clusters: core, extended 
and full [55]. Core transcript clusters are suppor- 
ted by the most reliable evidence such as RefSeq 
transcripts and full-length mRNAs [26] and a core 
transcript cluster is roughly a gene [57]; the exten- 
ded level contains the core transcript clusters plus 
cDNA-based annotations [55J and the full level con- 
tains the two previous levels plus ab-initio, or al- 
gorithmic, gene predictions [29]. It is worth noting 
that aroma. affymetrix enables the analysis at the 
three levels of annotation mentioned above, and also 
that it provides intensity estimates for probes, probe- 
sets and transcript clusters, allowing for a variety of 
options for the analysis. 

3 Workflow 

Our workflow for the analysis of exon array data 
starts setting up the required folder structure for 
aroma. affymetrix. The data is then preprocessed 
and summarised at transcript cluster and/or probe- 
set level. Next, transcript clusters are analysed with 
several statistical models to detect differential expres- 
sion or splicing and the transcripts of interest are an- 
notated and visualised at the end, see Figure[l] In the 
code, places where user input is needed are marked by 
and places where the user can choose whether 
to modify parameters are marked by "**". 

To exemplify the use of the tutorial we have used 
Affymetrix's colon cancer data set [13], consisting of 
a collection of paired samples of colon tumour tissue 
and adjacent normal tissue from 10 patients and 
available at http : //www . affymetrix . com/support/ 
technical/sample_data/exon_array_data. af f x. 
According to Affymetrix's website, the RNA samples 
are from a commercial source. This dataset has been 
used in a number of papers to evaluate the perform- 
ance of different analysis methods [301 1311 1321 [33] and 
a number of genes have been validated to present 
differential splicing or not [14]. The analysis was 
done in R version 2.15.1 (32 bit). 

Start by installing and loading aroma. affymetrix 
in R and loading the other libraries required: 
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Figure 1: Flowchart for an analysis with aroma. affymetrix, read counter-clockwise starting in upper left 
corner: 1. and 2. folder structure set-up including library, annotation and .CEL files, 3. data pre-processing 
and summarization, 4. extraction of intensities at transcript cluster, probeset and probe level, including 
filtering recommended by Affymetrix, 5. statistical analysis of differentially expressed or spliced transcript 
clusters and 6. annotation and visualization of transcript cluster profiles. Blue boxes represent parts of 
the analysis implemented in aroma. affymetrix, yellow and green boxes are part of the code provided in 
this paper, and purple boxes represent user-input needed. Output produced at several steps is saved in 
user-chosen "output .folder" and represented by a star shape in the workflow. A number of folders are 
automatically generated by aroma, affymetrix - represented by a faded yellow rectangle in 3. -, our workflow 
does not make use of the contents of such folders. 
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> source ("http: / /aroma-project . org/hbLite . R") 

> hblnstall (" aroma . affymetrix" ) 

> require (aroma . affymetrix) 

> require (biomaRt ) 

> require (GenomeCraphs) 

3.1 Setting up the structure and files 
for the analysis workflow 

This section corresponds to steps 1. and 2. 
in Figure [I] The first step is to create the 
folder structure: under a main folder of our 
choice - "myworkingDirectory" - we will cre- 
ate the "rawData" and "annotationData" folders, 
which will be common to all aroma, affymetrix 
projects. Inside "annotationData", the subfolder 
"chipTypes" will contain one subfolder per chip type, 
with the exact name of the .CDF file provided by 
Affymetrix, "HuEx-l_0-st-v2" in our case. Inside 
this folder we will save any library and annotation 
files that might be needed. Besides, the "myDataSet" 
folder will be created under "rawData" to store .CEL 
files. These files are the output of a microarray ex- 
periment and contain the result of the intensity calcu- 
lations per probe or pixel. Note that the microarray 
experiment produces one .CEL file per array and that 
one array analyses one sample. Affymetrix sometimes 
refer to their microarrays as "chips". Note also that 
we need one "myDataSet" folder per experiment and 
that "myDataSet" will be added as a tag at the end 
of the aroma, affymetrix output. 

> #*** user-defined working directory 

> wd <- "myWorkingDirectory" 

> #*** user-defined data set name 

> ds <- "myDataSet" 

In the second step we save our library (Chip 
Definition File (.CDF) in our case) and .CEL 
files in the corresponding folders. Affymetrix's 
unsupported .CDF files can be downloaded 
from http : //www. affymetrix . com/Auth/ support/ 

downloads/library_f iles/HuEx- l_0-st-v2 . cdf .zip, 
note that registration is needed. For the exon array, 
Elizabeth Purdom has created a number of binary 
.CDF files based on Affymetrix's text .CDF file 
[13] that are faster to query and more memory 
efficient. Such binary .CDF files for core, extended 
and full sets of probesets can be downloaded from 
http://aroma-project.org/node/122. In the 
example below we use the custom aroma file for core 
transcript clusters, which might be updated in the 
future. Our original .CEL files will be copied from 
the user-specified "myCELf ileDirectory" into the 
exon "rawData" subfolder (the code is part of the 
.Snw version of this paper). The desired output 
folder specified in "output . folder" should exist in 
advance. 



> #** download user-defined library file 

> library . file <- 

+ paste (annotation . data . exon, 

+ "HuEx-l_0-st-v2, coreR3, A2 007 11 12, EP .cdf" , 

+ sep = "/") 

> download . address <— 

+ "http : / /bege . Ibl . gov/cdfFiles/" 

> file <- paste ("HuEx-l_0-st-v2, A20071112, EP", 

+ "HuEx-l_0-st-v2, coreR3,A20071112, EP .cdf" , 

+ sep = "/") 

> custom. cdf <— 

+ paste (download . address, file, sep = "") 

> download . file (url = custom, cdf, 

+ destfile = library . file, 

+ mode = "wb", quiet = FALSE) 

> user-defined directory containing .CEL files 

> eel . directory <- "myCELf ileDirectory " 

> #*** user-defined output folder 

> output . folder <- "output . folder " 



/") , 



> transcript . clusters .NetAf fx. 32 <- 

+ read, csv (file = paste (annotation . data . exon, 

+ "HuEx-l_0-st -v2 . na32 . hgl 9 . transcript . csv" , 

+ sep = "/"), skip=24) 

> probesets .NetAf fx . 32 <- 

+ read. csv (file = paste (annotation . data . exon, 

+ " HuEx-l_0-st-v2 . na32 . hgl 9 . probeset . csv" , 

+ sep = "/"), skip=23) 



Besides, the sample information should be saved 
in a tab separated file with column names 
celFile, replicate, and treatment containing 
.CEL file name (without .CEL), replicate identi- 
fier and treatment name, respectively. This file 
should be called "SampleInformation.txt" and it 
will be copied from the user-specified directory into 
the "\rawData\myDataSet \HuEx-l_0-st-v2" folder 
(which will also contain the .CEL files) after it has 
been created. The sample information file for the 
colon cancer example is attached as an additional file. 

> sample . in fo <— 

+ read . table ( file = paste (raw . data . exon, 

+ "SampleInformation.txt", sep = " 

+ sep = "\t", header = TRUE) 

Finally, NetAffx transcript clusters' and 
probesets' annotation files should be saved in 
"annotationData/ chipTypes/" "HuEx-l_0-st-v2" . 
We have used release 32, which was most up to 
date at the time of writing and we downloaded files 
"HuEx-l_0-st-v2.na32.hgl9.transcript.csv.zip" and 
"HuEx-l_0-st-v2.na32.hgl9.probeset.csv.zip" from 

http : //www . affymetrix . com/estore/browse/products . 
j sp?productId=131452&categoryId=35676&productName= 

GeneChip-Human -Exon-ST-Array#i_3[ Technical Docu- 
mentation tab, under NetAffx Annotation Files. The 
extracted .csv files should be converted into .Rdata 
files for querying them faster in the future. Note 
that the number of lines to skip might differ for 
future annotation files. 
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3.2 Data pre-processing and sum- 
marization to probeset/transcript 
cluster level 

After denning chip type and dataset, background cor- 
rection and quantile normalization are carried out as 
shown in Figure [T] step 3. In these pre-processing 
steps, it is possible to use either Affymetrix's ori- 
ginal .CDF file, the .CDF hies provided by the 
aroma project (the hie for core transcript clusters in 
our example), or a .CDF hie created by the user. 
The summarization step, however, must be done us- 
ing one of the custom .CDF files available at the 
aroma, affymetrix project website. 

Background correction as defined by Ir- 
izarry [34] and quantile normalization are per- 
formed by the RmaBackgroundCorrectionO and 
QuantileNormalizationO functions, respectively. 
The raw, background corrected and quantile nor- 
malized probe intensities can be visualized using 
the plotDensity () function applied to the cor- 
responding object. Summarization is done with 
the ExonRmaPlm function 35^ The parameter 
mergeGroups determines whether to summarize at 
transcript level (TRUE) or probeset level (FALSE). 
All the functions described automatically create 
subfolders such as "plmData" or "probeData" inside 
"myWorkingDirectory" . A more detailed version 
of this code with interesting comments about the 
choice of .CDF and possibilities for quality control 
is available at http://www.aroma-project.org/ 



The respective intensities are obtained by apply- 
ing the function getChipEf f ectSet () to the tran- 
script or probeset plm objects (plmTr and plmEx, 
respectively) and then extracting the correspond- 
ing dataframes. However, it is also possible to ex- 
tract the background corrected and quantile nor- 
malized intensities of all probes using the function 
getUnitlntensities. While plmTr is suitable for 
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> chipType <- "HuEx-l_0-st-v2 " 

> #** user-defined .CDF file: change tags paramet 

> cdf <- AffymetrixCdfFile$byChipType (chipType 
+ tags = "coreR3, A20071112, EP" 

> cs <- AffymetrixCelSet$byName(ds, cdf = cdf) 
> 

> # background correction 

> be <- RmaBackgroundCorrection (cs) 

> csBC <- process (be, verbose = verbose) 
> 

> # quantile normalization 

> qn <- QuantileNormalization (csBC, 

+ typesToUpdate = "pm") 

> csN <- process (qn, verbose = verbose) 
> 

> # summarization 

> # transcript cluster level 

> plmTr <- ExonRmaPlm (csN, mergeGroups = TRUE, 

+ tag = "coreProbesetsGeneExpressi 

> # probeset /exon level 

> plmEx <- ExonRmaPlm (csN, mergeGroups = FALSE, 

+ tag = "coreProbesetsExonExpressi 



3.3 Extraction of intensity estimates 

The aroma. affymetrix documentation focuses on 
analyses at the probeset and transcript cluster levels. 



the FIRMA analysis (3.4 1, plmEx is well suited for 
probeset-level analysis. For the linear model ana- 
lysis described in Section [374] we have created one list 
of dataframes containing probe intensities per tran- 
script cluster, and another list of dataframes contain- 
ing probeset intensities per cluster (code included in 
.Snw). 

> # extract a matrix of gene intensities 

> cesTr <- getChipEffectSet (plmTr) 

> trFit <- extractDataFrame (cesTr, addNames=TRUE) 

> # extract a matrix of probeset intensities 

> cesEx <- getChipEffectSet (plmEx) 

> exFit <- extractDataFrame (cesEx, addNames=TRUE) 

> # extract a list of probe intensities per gene 

> unitlntensities <— 

readUnits (csN, verbose=verbose) 

The high number of transcript clusters analysed in 
combination with the usually small number of chips 
tends to cause a high number of false positives [25] , 
In order to reduce the number of false positives, Affy- 
metrix recommends to perform detection above back- 
ground (DABG) [36] on the dataset prior to the ana- 
lysis [25]. The DABG procedure is not implemented 
in aroma. affymetrix, so we decided to follow the 
procedure described in [3U] and use 3 as a threshold 
for the probeset intensity, so that probesets with a 

e log 2 intensity below 3 will be marked as absent. Ex- 
cept for this change, we followed the guidelines pro- 
posed in |25j to remove absent transcript clusters and 
probesets, where neither probesets that are absent in 
more than half of the samples of a group nor tran- 
script clusters with more than half of the probesets 
absent are analysed. 

Besides this filtering based on expression levels, 
another filtering that removes probesets presenting 
cross- hybridization is also advisable [25J. Cross- 
hybridising probesets are identified in file "HuEx-l_0- 
st-v2.na32.hgl9. probeset. csv" and removed. Affyme- 
trix recommends to filter them out after the analysis, 

JftVt we have decided not to include them in the ana- 
lysis in order to narrow down the number of probe- 
sets/transcript clusters to investigate. 
""The filtering procedure is part of the .Snw file. In 
our example, where we analysed only core probesets, 
136233 probesets out of 284258 were deemed present 
by our filter, and the number of transcript clusters to 
analyse (present in both samples) was reduced from 
18708 to 8401. 



page 6 of 14 



Rodrigo-Domingo et al. 



3.4 Statistical analysis 

In this section we give an overview of model-based 
statistical methods available for the analysis of differ- 
ential splicing and suggest a method for the analysis 
of differential gene expression. 

3.4.1 Differential splicing 

The analyses carried out in this paper are model- 
based approaches, and the analysis of differential spli- 
cing is done genewise. The models we use are exten- 
sions of the linear model by Li and Wong [37] 



Vpi 



Pt 



-pt j 



(1) 



where y pt is the intensity measure of probe p for treat- 
ment t, a p is a probe affinity term, j3 t is the gene-level 
estimate for treatment t, and e p t is the error term. 

ANOSVA (Analysis of Splicing Variation) was 
presented by Cline et al. [38] and is a two-way AN- 
OVA model with probeset and treatment as factors: 



Vpet = M + a e + Pt + let + £pet, 



(2) 



where y pe t is the intensity measure of probe p in 
probeset e and treatment t, the overall mean /i is the 
baseline level of all probes in all experiments, and a e 
and /3t are the probeset and treatment effects. The 
interaction term j et tells whether the effect of the 
probeset depends on the treatment and is therefore 
key to the detection of differential splicing. 

The model in Q can be extended to include ran- 
dom effects associated to replications from the same 
individual, r: 



fl + a e + Pt + let + Ir + C tr + £ 



petr j 



(3) 



where I r is the random effect of each individual r and 
Ctr is the random chip effect. The error terms e pe tr 
are 1.1. d. N(0,CT 2 )-distributed. 

Under the null-hypothesis of no differential spli- 
cing, the let's will all be zero, therefore we consider 
the test statistic 

, _ let 
let — 

a 

where i e t is the estimate for 7 et , and a is the stand- 
ard error of j et . Large values will be critical for the 
null hypothesis. Under the model assumptions, t will 
follow a tff-T-(n B +R-i) distribution [39, Chap. 5], 
with iV the total number of observations per tran- 
script cluster, T the number of treatments, n e the 
number of probesets in the transcript cluster and R 
the number of individuals. For each gene, the smal- 
lest p-value from the above t-tests is regarded as a 
measure of confidence that the gene is differentially 
spliced across the experimental conditions |38j . The 
interaction estimates and variances and thus the test 



statistics are contrast-dependent, so choosing a differ- 
ent contrast will alter the gene lists. In our analysis, 
we have used the sum contrast available in R, where 
parameter estimates are centred around zero. 

A package in R for fitting linear and generalized lin- 
ear mixed-effects models is lme4 [52). Linear mixed- 
effects models can also be analysed using lm, which 
is faster but requires a balanced design. In our ex- 
ample dataset we use lm to fit the model in equa- 
tion Q to the probe level estimates obtained us- 
ing unitlntensities () from aroma. affymetrix to 
8075 multiexonic transcript clusters. Here, we only 
show the code corresponding to equation Q , the rest 
of the code is part of the .Snw file. 

> # ** user-defined parameters for linear model 

> lm <- 

+ lm (intensity ~ probeset + treatment + 

+ C (probeset : treatment , contr.sum) 

+ replicate/treatment) 

> n. probesets <- length (unique (data frame$probeset ) ) 

> main . effects <— 

+ 1 + (n . probesets - 1) + 

+ (length (unique (dataf rame$treatment ) ) - 

> DS . parameters <- (n. probesets - 1) * 

+ ( length (unique (dataf rame$treatment) ) - 

> p.t <- 

+ min (summary (lm) $coefficients [ (main . effects+1 ) 

+ (main . effects + DS . parameter s ), "Pr (> / 1 / )"] ) 

Although the vast majority of probesets contain 4 
probes, transcript clusters containing probesets with 
less than 4 probes will give rise to an unbalanced 
design. Nevertheless, the t-distribution is almost a 
normal distribution for long transcript clusters so the 
unbalanced design does not have any practical im- 
plications. For shorter transcript clusters, however, 
an unbalanced design might be a problem. 

The top 10 most differentially spliced genes, sor- 
ted by the minimum p-value of their t-tests, appear 
in Table [Ta[ The gene ZAK was validated as dif- 
ferentially spliced in [H] . See Figure [2] for the pro- 
file plot of KLK10, where the thick lines representing 
the mean intensity in each group have been plotted 
for easing the interpretation. Note that there is one 
measurement per probe in each probeset, typically 4 
probes per probeset. How to obtain such plots is de- 
scribed in Section |3.5| below. The genes in Figures 



[2] [3] [4] and [6] were chosen because they span over a 
shorter genomic region and show a more clear picture 
of the relationship between probesets and exons than 
the other genes in the lists of top 10 genes. 

A slight variation of ANOSVA is the probeset- 
model as implemented in Partek [40 (note that the 
probe subscript p has been removed): 

Vetr = A« + a e + Pt + let + Ir + C tr + £etr- (4) 

For the probeset-level ANOSVA we used the 
probeset level estimates obtained by affyPLM(. . . , 
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mergeGroups = FALSE) . After filtering for non- 
present or cross-hybridising probesets, and absent 
transcript clusters, we were left with 9494 transcript 
clusters to study. These clusters are sorted accord- 
ing to the minimum p-value of the individual i-test 
scores for differential splicing, and the top 10 genes 
obtained appear in Table lb The genes MMP11, 



ZAK and COMP are in the top 10 genes for both the 
ANOSVA probe and the ANOSVA probeset models. 
Gene TGFBI appears in Figure [3j 

FIRMA (Finding Isoforms using Robust Multi- 
chip Analysis) was first introduced by Purdom et al. 
[30] for the exon array. In presence of differential spli- 
cing the model in ([TJ) will not fit and this will show 
up in the residuals. The linear model used is the 
following: 

Vpetr = a p + fitr + (5) 

with a p the probe affinity, f3tr the gene-level effect for 
chip tr and the error terms e pe t r are i.i.d. N(0, (re- 
distributed. Note that in contrast to the model in 
([3]) we do not compute an overall gene-level estimate, 
but a gene-level estimate per chip. 

As in model ([TJ), there is no treatment /probeset 
interaction term, so differential splicing is analysed 
probesetwise, using the residuals per probeset e: 



P=l, 



ipetr — Vpetr 

..n e ,t=l,.. 



- (dp + far) 
T r = 1 



(6) 



,R 



point at exons more differentially spliced between the 
two conditions. The resulting list was filtered to keep 
only probesets and transcript clusters that has passed 
the filter described in |3.3| above, the code is part of 
the .Snw file. In order to get a gene list instead of a 
probeset list as in |30j we mapped probesets to tran- 
script clusters and then selected the top 10 genes 
on the list, see Table [Tc] The profile plot of gene 
LGALS4 appears in Figure|4j The very high average 
difference for this gene is due to a FIRMA score of 
830030.8 at probeset 3861578 corresponding to the 
tumour sample of replicate 7. 

Out of 14 genes investigated for differential splicing 
in [21 Table 1] , 10 passed our filtering procedure and 
were analysed for differential splicing: ACTN1, VCL, 
CALD1, SLC3A2, COL6A3, CTTN. FN1, MAST2, 
ZAK and FXYD6. The gene ZAK appears in the top 
10 genes from ANOSVA probe and ANOSVA probe- 
set, but a plot is not produced automatically by the 
code because ZAK is not recognised by BioMart. In- 
stead, we looked the gene up in PubMed obtaining 
the Ensembl ID: ENSG00000091436, and used this 
to plot the gene, see Figure [5] The gene COL6A3 
appears among the top 100 genes for the ANOSVA 
probe and the ANOSVA probeset methods. Gene 
ACTN1 is among the top 100 genes for ANOSVA 
probeset and it is also the first of Gardina's genes to 
appear in the filtered FIRMA list, at position 154. 



where a p and /3 tr are the estimates of a p and /3 tr - 

The median of the standardized residuals per 
probeset per chip is chosen as score statistic: 



771 median 

r e(tr) p— l,...,n. 



(Tpetr \ 
MAD/ 



(7) 



The standardization with the median absolute de- 
viation (MAD) of the residuals per gene makes the 
scores comparable across transcript clusters. 

The FIRMA scores are extracted from the plmTr 
object obtained in Section 3.2 All probesets and 



transcript clusters were analysed by FIRMA, since 
it is part of the default aroma, affymetrix workflow. 

> firma <- FirmaModel (plmTr) 

> fit (firma, verbose = verbose) 

> fs <- getFirmaScores (firma) 

> firma . scores <- extractDataFrame (fs) 

After obtaining the FIRMA scores per probeset per 
sample, we proceeded as described in [30]: 1) For each 
probeset we took the difference of FIRMA scores for 
each of the 10 pairs of normal/cancer samples and 2) 
calculated the mean of the 10 differences per probe- 
set. Then 3) we ranked the probesets according to 
their absolute mean difference: as the scores are com- 
parable across transcript clusters, larger average dif- 
ferences between the normal and cancer samples will 



3.4.2 Differential gene expression 

In this section we analyse differential gene expres- 
sion using probe level data. We study two types 
of transcript clusters: 1) the ones not included in 
the ANOSVA probe analysis above (with only one 
probeset or not present in both normal and cancer 
groups) and 2) the ones not showing differential spli- 
cing (ANOSVA probe p- value above 0.1). The ana- 
lysis of group 2) is based on the hierarchical principle: 
only look for significant main effects (differential ex- 
pression in this case) among those transcript clusters 
with no significant interaction terms (differential spli- 
cing) 46, p. 427]. In total, we analysed 1G231 tran- 
script clusters. We fit the following linear model to 
those transcript clusters: 



Vpetr = A* + "e + fit + I r + C tr + £ 



petr 1 



(8) 



where /3t is the gene level treatment effect, a e is a 
parameter that captures probesets expressed above 
or below the overall transcript cluster level and I r 
and Ctr are random effects for patient and chip, re- 
spectively. 

> aov <- 

+ aov (intensity ~ probeset + treatment + 

+ Error (replicate + replicate : treatment) ) 
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Ggiig symbol 


min p- value 




Gsn.6 symbol 


min £)-v£ilu.c 


MMPll 


2.98e - 10 




SOX4 


9.07e - 15 


SOX9 


2.59e - 09 




MMPll 


9.48e — 14 


FOXQ1 


6.38e - 09 












FOXQ1 


5.98e - 13 


SYNM 


4.53e - 07 








PHLDA1 


4.85e - 07 




UBAP2L 


8.17e - 12 






COMP 


2.87e - 11 


SNTB1 


5.33e - 07 




SLC2A1 


4.16e - 11 


COMP 


6.85e - 07 




CDH11 


4.19e - 10 


XPOT 


7.87e - 07 




CPXM1 


4.47e - 10 



(a) Model: ANOSVA probe. (b) Model: ANOSVA probeset. 



Gene symbol 


p- value 


LARGE 


0.00287 


IL6R 


0.00847 


R.XR.G 


0.00847 


BAI3 


0.00968 


Clorfl75 


0.01090 


GRIK3 


0.01090 




KCNN3 


0.01090 


PLEKHH2 


0.01090 


HOXD10 


0.01090 



Gene symbol 


Average difference 


HMGCS2 


82.9 


RPS6KA1 


31.8 


MUC13 


28.2 


RPL35 


24.5 


RPL35A 


18.8 


SLC39A14 


18.2 


TGFBI 


16.2 


COL23A1 


15.5 


SFT2D2 


14.2 



(c) Model: FIRMA. 

Table 1: Top 10 differentially spliced genes from the 

Tables (a) and (b) show- 



models described in 3.4.1 



genes with minumum p-values, table (c) shows genes 
with their top scores. Genes highlighted in gray ap- 
pear in Figures [2j [3j [i] and [5] 



> p.F <- 

+ summary (aov) $ "Error : Within" [ [1] ] ["treatment' 
+ "Pr (>F) 

The null hypothesis is that the gene expression is 
the same in all groups. The top 10 genes, with ad- 
justed p- values appear in Table [2] The method used 
for adjusting the p-values was Benjamini-Hochberg's 
correction |47j using the function p . adjust (... , 
method = "BH"). Only 80 out of 159 genes ap- 
pearing in Gardina's list of genes up- and down- 
regulated in tumour [TH Additional file 1] passed our 
filters and were analysed for differential gene expres- 
sion. Among those analysed, the genes CLDN1, SST, 
MUSK, KIAA1199 and SLC30A10 were in the top 
100. The gene BEST4 is shown in Figure [§ This 
gene is down-regulated in tumour (blue) compared 
to normal (red) samples. The thicker lines represent 
the mean expression levels in the two groups. 

3.5 Gene annotation and visualization 

We chose to annotate transcript clusters to genes 
using the NetAffx transcript cluster annotation 
specified 



release 32 specified in Section 3.1 using the 
AnnotateGenes () function. Some transcript clusters 
present unspecific annotation and have several pos- 



Table 2: Top 10 differentially expressed genes from 
^ with corrected p-values. Gene BEST4, high- 
lighted in gray, appears in Figure |6] 

sible associated gene names. We have decided to re- 
move such clusters from our output as we cannot map 
them uniquely to a gene and afterwards interpret the 
result according to the gene structure. 

After gene annotation, the user can select genes for 
visual inspection. Visual inspection of candidates for 
differential splicing is recommended by Affymetrix as 
a way to identify possible false positives [3S] . Plots of 
gene profiles with integrated genomic information are 
obtained using the biomaRt 48 and GenomeGraphs 
[49] packages in Bioconductor. 

We use bioMart to connect to the latest version of 
the Homo Sapiens dataset in Ensembl 50J (Ensembl 
^genes 68, GRCh37.p8 at the time of writing) and re- 
trieve genes by their HGNC symbol [SI]. Gene and 
exon structures are imported from Ensembl. Gene 
and Exon objects are created by makeGeneO and 
makeTranscript () from GenomeGraphs. We store 
expression data and probeset start and stop posi- 
tions in an ExonArray object by makeExonArray () . 
The final plot is created passing a list with the 
objects created to the gdPlot (list (exon, gene, 
transcript ,...)) function. 

Our plots show on top the gene HGNC symbol fol- 
lowed by (+) for genes on the forward strand and by 
(-) for those on the reverse strand. Below, the plot 
of probeset intensities appears with vertical lines de- 
limiting probesets. Note that for models based on 
probe-level data (ANOSVA probe, FIRMA and dif- 
ferential expression), the intensities of all probes in 
the probeset (1 to 4) are shown. Samples from the 
same treatment group appear in the same colour, red 
for normal samples and in blue for tumour samples in 
this case. Immediately after the profile plot, the gene 
model retrieved from Ensembl is shown in orange, 
followed by the possible transcript model(s) in blue. 
The gene model consists of the exons that appear in 
all possible transcript models. Exons (boxes) in the 
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gene model are linked by blue lines to the probesets 
above, indicating which probeset(s) interrogate which 
exon. See Figures [2] to [6] 

4 Discussion 

The aim of this paper was to give a tutorial on how 
to perform a complete and reproducible analysis of 
exon array data in R/Bioconductor. We have worked 
with three packages: aroma. affymetrix, biomaRt 
and GenomeGraphs to go from .CEL files to intens- 
ity data, statistical analysis, annotation and visualiz- 
ation. The packages were chosen for their flexibility 
and easy of integration. We believe that our work- 
flow covers a number of analysis variants for the exon 
array, including differential splicing analysis at probe 
and probcsct-lcvcl and differential expression analysis 
at probe level, and gives the user the opportunity to 
focus on all or only some of the aspects of the data 
analysis. We make our entire code available so that 
other researchers can use it as it is or adapt it to their 
needs. 

Different Bioconductor packages could have been 
used in some of the analysis steps. For example, 
xmapcore [54 provides annotation data and cross- 
mappings between genetic features as transcript 
clusters or exons and Affymetrix probesets. This 
package, however, requires the separate installation 
of a MySQL database, which makes this a more com- 
plex alternative than the one we have chosen. The 
xps package [55J could have been used for data pre- 
processing and summarization, but it requires the in- 
stallation of the ROOT framework [56 , and a certain 
level of understanding of ROOT files and ROOT trees 
is recommended. Our workflow does not require any 
prior knowledge beyond R/Bioconductor. Other free 
software includes BRB- Array Tools [57], based on R, 
C, Fortran and Java, with an Excel front end, and 
dChip [55], which is written in Visual C++ and de- 
veloped for Windows - though some users have been 
able to run it on Mac and Unix computers. 

A previous paper on exon arrays [Hj suggests a 
pragmatic approach and does the analysis piecewise 
starting with Affymetrix Power Tools (APT) and 
then exporting the data to R. We recognise this is a 
fix for the lack of straightforward packages for deal- 
ing with the exon array in Bioconductor. However, 
it implies working with several pieces of software so 
we do not find it fit for reproducible research. Li- 
censed software, like Partek [JO] or GeneSpring GX 
[55] , has been used in other studies [BUI [22] ■ In con- 
trast to licensed software, R/Bioconductor is free and 
available for anyone, it allows the user to control 
most analysis options and it enables customizable 
and reproducible analyses that are more easily re- 



viewed. Still, the aroma. affymetrix package does 
not provide the speed of Affymetrix Power Tools or 
the licensed software, and it requires more user input. 
Nevertheless, with this code and minimal user input, 
any dataset can be analysed regarding differential ex- 
pression at probe level and differential splicing using 
the ANOSVA model. 

Differential gene expression could have been ana- 
lysed using the gene level estimates obtained from the 
plmTr object in other R packages such as limma [53) . 
Another extension of the workflow could include a 
general analysis strategy of the FIRMA scores, which 
in this study was tailor-made for a two-treatment 
scenario. Besides, the profile plots we obtained with 
GenomeGraphs, though highly informative, are diffi- 
cult to interpret for genes spanning over a long gen- 
omic region, as for example TGFBI in Figure [3] and 
ZAK in Figure [5] In the future, it would be inter- 
esting to study the flexibility of the output imported 
from Ensembl and, for example, remove the intronic 
regions from the gene and transcript models. 
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[5tJ] . The (-) next to the gene name indicates that it is on the reverse strand. The mean intensities of each 
group are plotted with a thicker line. Exons 5 (mapped by probesets 4 and 5) and 8 (mapped by probeset 
9), counted from the 5' end, seem to be higher expressed in tumour samples (blue) than in normal (red). 
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Figure 4: Profile plot of gene LGALS4 with the gene model and transcripts retrieved from Ensembl (50j . 
the gene is on the reverse strand. Interestingly, this gene only has one transcript. We might then be facing 
a novel splicing event or, more likely, a false positive detected by the method. 




Figure 5: Profile plot of gene ZAK with the gene model and transcripts retrieved from Ensembl [SO]- There 
seems to be a differential splicing event identified by probesets 5 and 6 (counted from the 3' end of the gene), 
corresponding to exons 3 and 4 from the 3' end. 
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Figure 6: Profile plot of gene BEST4, which is on the reverse strand, with the gene model and transcripts 
retrieved from Ensembl |50j . The gene is under expressed in tumour (blue) compared to normal (red) 
samples. The thin lines are the expression levels for each sample, while the thicker lines represent the mean 
intensities in each of the two groups. 
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